U

您所在的位置:网站首页 semi supervised defect detection with normalizing flows U

U

2022-12-08 11:01| 来源: 网络整理| 查看: 265

U-Flow: A U-shaped Normalizing Flow for Anomaly Detection with Unsupervised Threshold Matías Tailanian Digital Sense & Universidad de la República, Uruguay    Álvaro Pardo Digital Sense & Universidad Católica del Uruguay    Pablo Musé Universidad de la República, Uruguay Abstract

In this work we propose a non-contrastive method for anomaly detection and segmentation in images, that benefits both from a modern machine learning approach and a more classic statistical detection theory. The method consists of three phases. First, features are extracted by making use of a multi-scale image Transformer architecture. Then, these features are fed into a U-shaped Normalizing Flow that lays the theoretical foundations for the last phase, which computes a pixel-level anomaly map, and performs a segmentation based on the a contrario framework. This multiple hypothesis testing strategy permits to derive a robust automatic detection threshold, which is key in many real-world applications, where an operational point is needed. The segmentation results are evaluated using the Intersection over Union (IoU) metric, and for assessing the generated anomaly maps we report the area under the Receiver Operating Characteristic curve (ROC-AUC) at both image and pixel level. For both metrics, the proposed approach produces state-of-the-art results, ranking first in most MvTec-AD categories, with a mean pixel-level ROC-AUC of 98.74%.

Code and trained models are available at https://github.com/mtailanian/uflow.

1 Introduction Figure 1: Anomalies detected with the proposed approach, on MvTec-AD examples from different categories. Top row: original images with ground truth segmentation. Middle and bottom rows: corresponding anomaly maps and automatic segmentations.

The detection of anomalies in images is a long-standing problem that has been studied for decades. In recent years this problem has received a growing interest from the computer vision community, motivated by applications in a wide range of fields, ranging from surveillance and security, to even health care. But one of the most common applications, and the one that we are especially interested in, consists in automatizing the quality control of products in an industrial environment. For this case, it is often very hard (or sometimes even infeasible) to collect and label a good amount of data with a representation of all kinds of anomalies, since by definition these are rare structures. For this reason, the major effort on anomaly detection methods has been focused on unsupervised and self-supervised non-contrastive learning, where only normal samples (i.e. anomaly-free images) are required for training.

Currently, the most common approach for anomaly detection consists in embedding the training images in some latent space, and then to evaluate how far a test image is from the manifold of normal images [cohen2020sub-spade, defard2021padim, tsai2022multi-mspba, yi2020patch-svdd, roth2022towards-patchcore, lee2022cfa].

Other approaches focus on learning the probability density function of the training set. Among these, three types of generative models became popular in anomaly detection in the last five years. On one hand, Generative Adversarial Networks [goodfellow2014generative, anogan] learn this probability in an implicit way, being able only to sample from it. On the other hand, Variational Auto-Encoders [kingma2013auto, yang2020dfr] are capable of explicitly estimating this probability density, but they can only learn an approximation, as they maximize the evidence lower bound. And finally, Normalizing Flows [dinh2014nice, dinh2016density-nvp, kingma2018glow] are able to explicitly learn the exact probability density function. Being able to do so has several advantages, as it allows to estimate the likelihood and score how probable it is that the tested data belongs to the same training distribution. A major advantage, that we exploit in this work, is the possibility to develop formal statistical tests to derive unsupervised detection thresholds. This is a key feature for most real-world problems, where a segmentation of the anomaly is needed. To this end, we propose a multiple hypothesis testing strategy based on the a contrario framework [desolneux2007gestalt].

In this work we propose U-Flow, a new self-supervised non-contrastive method that ensembles the features extracted by multi-scale image transformers into a U-shaped Normalizing Flow’s architecture, and applies the a contrario methodology for finding an automatic segmentation of anomalies in images. Our work achieves state-of-the-art results, widely outperforming previous works in terms of IoU for the segmentation, and exhibiting top performance in localization metrics in terms of ROC-AUC. The main contributions of this paper are:

We propose a novel feature extractor, based on image Transformers combined in a multi-scale configuration.

We leverage the well-known advantages of U-like architectures in the setting of Normalizing Flows (NF).

Thanks to a NF that captures information at multiple image scales, we are able to provide a multi-scale characterization of the space of anomaly-free training images, as a latent space where both intra and inter-scale features are statistically independent.

We propose a multiple hypothesis testing strategy that leads to an automatic detection threshold to segment anomalies, outperforming state-of-the-art methods.

In short, we propose an easy-to-train method, that does not require any parameter tuning or modification to the architecture (as opposed to other state-of-the-art methods), that is fast, accurate, and provides an unsupervised anomaly segmentation. Example results are shown in Figure 1.

The remainder of this paper is organized as follows. In Section 2 we discuss previous work related to the proposed approach. Details of the method are presented in Section 3. In Section 4 we present the results of our anomaly detection method, along with extensive comparisons to the state of the art. An ablation study, that analyzes the role of each component of the proposed architecture, is presented in Section 5. We conclude in Section 6.

2 Related work

The literature on anomaly detection is very extensive, and many approaches have been proposed. In this work we focus on non-contrastive self-supervised learning, where the key element is to learn only from anomaly-free images. Within this group, methods can be further divided into several categories.

Representation-based methods proceed by embedding samples into a latent space and measuring some distance between them. For example in CFA [lee2022cfa], the authors propose a method to build features using a patch descriptor, and building a scalable memory bank. A special loss function is designed to map features to a hyper-sphere and densely clusterize the representation of normal images. Also, the method has the possibility of adding a few abnormal samples during training, that are used to enlarge the distance to the normal samples in the latent space. Similarly, in [yi2020patch-svdd], the authors also construct a hierarchical encoding of image patches and encourage the network to map them to the center of a hyper-sphere in the feature space. This work later inspired [tsai2022multi-mspba], where the authors aim to learn more representative features from normal images, using three encoders with three classifiers. In SPADE [cohen2020sub-spade] the anomaly score is obtained by measuring the distance of a test sample to the nearest neighbors (NN) that were computed at the train stage over the anomaly-free images. Additionally, PaDiM [defard2021padim] proposes to model normality by fitting a multivariate Gaussian distribution after embedding image patches with a pre-trained CNN, and base the detection on the Mahalanobis distance to the normality. Then, PatchCore [roth2022towards-patchcore] further improves SPADE and PaDiM, by estimating anomalies with a larger nominal context and making it less reliant on image alignment.

Reconstruction-based methods ground anomaly detection in the analysis of the reconstruction error. By using an encoder-decoder architecture such as [yang2020dfr], or a teacher-student scheme like [yamada2022reconstructed-rstpm], and training the network with only anomaly-free images, the network is supposed to accurately reconstruct normal images, but fail to reconstruct anomalies, as it has never seen them. Learning from one-class data leaves some uncertainty on how the network will behave for anomalous data. To overcome this potential issue, some works manage to convert the problem into a self-supervised one, by creating some proxy task. For example in [li2021cutpaste], authors obtain a representation of the images by randomly cutting and pasting regions over anomaly-free images to synthetically generate anomalies.

Figure 2: The method consists of three phases. (1) Multi-scale feature extraction: a rich multi-scale representation is obtained by combining a pre-trained image Transformers acting at different image scales. (2) U-shaped Normalizing Flow: by adapting the widely used U-like architecture to NFs, a fully invertible architecture is designed. This architecture is capable of merging the information from different scales, while ensuring independence both intra and inter-scales. To make it fully invertible, split and invertible up-sampling operations are used. (3) Anomaly score and segmentation computation: besides from generating the anomaly map based on the likelihood of test data, we also propose to adapt the a contrario framework to obtain an automatic threshold by controlling the allowed number of false alarms.

Regarding the use of Normalizing Flows in anomaly detection, a few methods have been recently proposed [rudolph2021same-differnet, yu2021fastflow, gudovskiy2022cflow], among which CFlow [gudovskiy2022cflow] and FastFlow [yu2021fastflow] stand out for their impressive results. The former uses a one-dimensional NF and include the spatial information using a positional encoding as a conditional vector, while the latter directly uses a two-dimensional NF. In both cases, an anomaly score is then computed based on the likelihood estimation of test data. In this work, we further improve the performance of these methods, by proposing a multi-scale Transformer-based feature extractor, and by equipping the NFs with a fully invertible U-shaped architecture. These two ingredients allow us to embed these multi-scale features in a latent space where intra and inter-scale features are guaranteed to be independent and identically distributed Gaussian random variables. From these statistical properties, we can easily design a detection methodology that leads to statistically meaningful anomaly scores and unsupervised detection thresholds.

3 Method

The proposed method is depicted in Figure 2, and consists of three main phases: 1) Feature extraction, 2) U-shaped Normalizing Flow, and 3) Anomaly score and segmentation. The three phases are presented in the following.

3.1 Phase 1: Feature extraction

Due to the fact that anomalies can emerge in a variety of sizes and forms, it is essential to collect image information at multiple scales. To do this, the standard deep learning strategy is to use pre-trained CNNs, often a VGG [vgg] or any variant of the ResNet [resnet] architectures, to extract a rich image feature representation. More recently, with the development of image vision Transformers, some architectures, such as ViT [vit] and CaIT [touvron2021going-cait] are also being used, since the features generated from them better compress all multi-scale information into a single feature map volume.

In this work, we construct a multi-scale Transformer architecture to further enhance these representations. More specifically, we employ two CaITs at different scales, independently pre-trained on ImageNet [imagenet]. The input size of our method matches the highest resolution CaIT, and the images are down-sampled by two before feeding them into the lower resolution Transformer. The fact that both CaIT networks were trained independently could be somehow beneficial, in the sense that it might give the network more flexibility to attend to different structures in each Transformer.

We clearly show the benefits of the proposed multi-scale CaIT in the ablation study carried out in Section 5, where we show that the results obtained are superior than using each one of the Transformers alone, and far better than other ResNet variants. Although the architecture is presented agnostic to the feature extractor, it is important to remark that all results presented in Section 4 were obtained using the exact same architecture.

3.2 Phase 2: Normalizing Flow

The rationale for using NFs in an anomaly detection setting is straightforward. The network is trained using only anomaly-free images, and in this way it learns to transform the distribution of normal images into a white Gaussian noise process. At test time, when the network is fed with an anomalous image, it is expected that it will not generate a sample with a high likelihood according to the white Gaussian noise model. Hence, a low likelihood indicates the presence of an anomaly.

This second phase is the only one that is trained. It takes as input the multi-scale representation generated by the feature extractor, and performs a sequence of invertible transformations on the data, using the NF framework.

State-of-the-art methods following this approach are centered on designing or trying out different existent multi-scale feature extractors. In this work, we further improve the approach by not only proposing a new feature extractor, but also a multi-level deep feature integration method, that aggregates the information from different scales using the well-known and widely used UNet-like [unet] architecture. The U-shape is composed of the feature extractor as encoder, and the NF as decoder. We show in the ablation study of Section 5.1 that the U-shape is a beneficial aggregation strategy and further improves the results.

The NF in this phase uses only invertible operations, and it is implemented as one unique graph, therefore obtaining a fully invertible network. Optimizing the whole flow all at once has a crucial implication: the Normalizing Flow generates an embedding that is independent not only inside each scale but also inter-scales, unlike CFlow [gudovskiy2022cflow] and FastFlow [yu2021fastflow]. It is worth mentioning that in all works so far, the anomaly score is computed first at each scale independently, using a likelihood-based estimation, and finally merged by averaging or summing the result for each scale. Because of the lack of independence between scales, these final operations, although achieving very good performance, lack of a formal probabilistic interpretation. The NF architecture proposed in this work overcomes this limitation; indeed, it produces statistically independent multi-scale features, for which the joint likelihood estimation becomes trivial.

Architecture. The U-shaped NF is compounded by a number of flow stages, each one corresponding to a different scale, whose size matches the extracted Transformer-based features (see Figure 2). For each scale starting from the bottom, i.e. the coarsest scale, the input is fed into its corresponding flow stage, which is essentially a sequential concatenation of flow steps. The output of this flow stage is then split in such a way that half of the channels are sent directly to the output of the whole graph, and the other half is up-sampled, to be concatenated with the input of the next scale, and proceed in the same fashion. The up-sampling is also performed in an invertible way, as proposed in [jacobsen2018revnet], where pixels in groups of four channels are reordered in such a way that the output volume has 4 times fewer channels, and double width and height.

Each flow step has a size according to its scale, and is compounded by Glow blocks [kingma2018glow]. Each step combines an Affine Coupling layer [dinh2014nice], a permutation using 1×1 convolutions, and a global affine transformation (ActNorm) [dinh2016density-nvp]. The Affine Coupling layer uses a small network with the following layers: a convolution, a ReLU activation, and one more convolution. Convolutions have the same amount of filters as the input channels, and kernels alternate between 1×1 and 3×3.

To sum up, the U-shaped NF produces L white Gaussian embeddings z1,…,zL, one for each scale l, zl∈RCl×Hl×Wl. Here we denote by (i,j) a pixel location in the input image, and by k the channel index in the latent tensor. Its elements zlijk ∼ N(0,1), 1≤i≤Hl, 1≤j≤Wl, 1≤k≤Cl, are mutually independent for any position, channel and scale i,j,k,l.

3.3 Phase 3: Anomaly score and segmentation

The last phase of the method is to be used at test time when computing the anomaly map and its corresponding segmentation. Thanks to the statistical independence of the features produced by the U-shaped NF, the joint likelihood of a test image under the anomaly-free image model is the product of the standard normal marginals.

More precisely, this third and last phase of our method produces two outputs (see Figure 2), described below.

3.3.1 Likelihood-based Anomaly Score

The first output is an anomaly map that associates to each pixel (i,j) in the test image, the same likelihood-based Anomaly Score proposed in FastFlow and CFlow,

ASij=−1LL∑l=1exp(−121ClCl∑k=112(zlijk)2). (1)

Note that this score does not correspond exactly to the anomaly-free likelihood, mainly since it averages the (unnormalized) likelihood of each scale, instead of computing their product. This of course lacks of a clear probabilistic interpretation, but produces high quality anomaly maps and allows us to compare our approach to other state-of-the-art methods.

In order to obtain a more formal and statistically meaningful result, the output of this third phase produces also a second anomaly score called Number of False Alarms (NFA), computed following the a contrario framework. Moreover, this framework permits to derive an unsupervised detection threshold on the NFA, that produces an anomaly segmentation mask.

3.3.2 NFA: Number of False Alarms

The a contrario framework [desolneux2007gestalt] is a multiple hypothesis testing methodology that has been successfully applied to a wide variety of detection problems, such as alignments for line segment detection [von2008straight], image forgery detection [marina], and even anomaly detection [tailanian2022contrario-book], to name a few.

The a contrario framework is based on the non-accidentalness principle. Given the fact that we do not usually know how the anomalies look like, it focuses on modeling normality, by defining a null-hypothesis (H0), also called background model. Relevant structures are detected as large deviations from this model, by evaluating how likely it is that an observed structure or event E would happen under H0. One of its most useful characteristics is that it allows to automatically set a detection threshold, by controlling the NFA, defined as:

NFA(E)=NTPH0(E), (2)

where NT is the number of events that are tested, and PH0(E) (also noted just as P in the following), is the probability of the event E to happen as a realization of the background model. Besides providing an automatic threshold, the NFA value itself has a clear statistical meaning: it is an estimate of the number of times, among the tests that are performed, that such tested event could be generated by the background model, namely under normality assumptions. A low NFA value means that the observed pattern is too unlikely to be generated by the background model, and therefore indicates a possible anomaly.

In our anomaly detection setting, as the embedding produced by the U-Flow are zlijk ∼ N(0,1),i.i.d., we characterize normality by defining a background model on (zlijk)2 to follow a Chi-Squared distribution of order 1,

(zlijk)2∼X2(1)i.i.d. (3)

Then, instead of basing the criterion only on the observed pixel, we consider a block Blij of size wl×wl×Cl centered on pixel (i,j) and spreading through all channels, as shown in Figure 3, and define a set of candidates SBlij as the voxels inside the block with suspiciously high values, according to our normality assumption given by (3):

SBlij={(i′,j′,k)/(zli′j′k)2>τ}, (4)

with i′∈[i−wl/2,i+wl/2], j′∈[j−wl/2,j+wl/2], k∈[1,Cl], and τ fixed to represent a p-value of p=0.9 in the X2(1) distribution.

Figure 3: Tested regions for the NFA computation. Detection is based on the concentration of candidates, marked as gray solid points inside the volumes (the outputs of the NF phase). Blocks in red (Bli1j1) represent an anomalous region, while blocks in green (Bli2j2) represent a normal concentration of candidates.

Under the normality assumption, all candidates are distributed uniformly. Therefore, we base our approach to detect anomalies on the concentration of these candidates. Given an observed number |SBlij| of such points within Blij, the probability that the background model produces at least this number of such points within Blij, which we denote by Plij, is given by the tail of the Binomial distribution: Plij=B(|SBlij|,w2lCl,1−p). However, as the amount of experiments is usually too high, and due to numerical instabilities, we instead evaluate its average by channel:

Plij=B(|SBlij|/Cl,w2l,1−p). (5)

Note that we want to evaluate the Number of False Alarms, which is an upper bound on the number of false detections, and therefore the average value |SBlij|/Cl is a conservative estimation.

Finally, we merge the results of different scales using the independence resulting from the NF, as: Pij=∏Ll=1Plij. The number of tests we perform is NT=∑lHlWl. As a result, following (2) and (5), the logarithm of the NFA value for the pixel (i,j) is given by:

log(NFAij) =log(∑lHlWl)+ (6) +∑llog(B(|SBlij|/Cl,w2l,1−p)). 4 Results Category PatchSVDD [yi2020patch-svdd] PaDiM [defard2021padim] CutPaste [li2021cutpaste] PatchCore [roth2022towards-patchcore] FastFlow [yu2021fastflow] CFlow [gudovskiy2022cflow] U-Flow (ours) Carpet (92.90 , 92.60) (- , 99.10) (100.0 , 98.30) (98.70 , 98.90) (100.0 , 99.40) (100.0 , 99.25) (100.0 , 99.42) Grid (94.60 , 96.20) (- , 97.30) (96.20 , 97.50) (98.20 , 98.70) (99.70 , 98.30) (97.60 , 98.99) (99.75 , 98.49) Leather (90.90 , 97.40) (- , 98.90) (95.40 , 99.50) (100.0 , 99.30) (100.0 , 99.50) (97.70 , 99.66) (100.0 , 99.59) Tile (97.80 , 91.40) (- , 94.10) (100.0 , 90.50) (98.70 , 95.60) (100.0 , 96.30) (98.70 , 98.01) (100.0 , 97.54) Wood (96.50 , 90.80) (- , 94.90) (99.10 , 95.50) (99.20 , 95.00) (100.0 , 97.00) (99.60 , 96.65) (99.91 , 97.49) Av. texture (94.54 , 93.68) (- , 96.86) (98.14 , 96.26) (98.96 , 97.50) (99.94 , 98.10) (98.72 , 98.51) (99.93 , 98.51) Bottle (98.60 , 98.10) (- , 98.30) (99.90 , 97.60) (100.0 , 98.60) (100.0 , 97.70 (100.0 , 98.98) (100.0 , 98.65) Cable (90.30 , 96.80) (- , 96.70) (100.0 , 90.00) (99.50 , 98.40) (100.0 , 98.40 (100.0 , 97.64) (98.97 , 98.61) Capsule (76.70 , 95.80) (- , 98.50) (98.60 , 97.40) (98.10 , 98.80) (100.0 , 99.10 (99.30 , 98.98) (99.56 , 99.02) Hazelnut (92.00 , 97.50) (- , 98.20) (93.30 , 97.30) (100.0 , 98.70) (100.0 , 99.10) (96.80 , 98.89) (99.71 , 99.30) Metal nut (94.00 , 98.00) (- , 97.20) (86.60 , 93.10) (100.0 , 98.40) (100.0 , 98.50) (91.90 , 98.56) (100.0 , 98.82) Pill (86.10 , 95.10) (- , 95.70) (99.80 , 95.70) (96.60 , 97.10) (99.40 , 99.20) (99.90 , 98.95) (98.80 , 99.35) Screw (81.30 , 95.70) (- , 98.50) (90.70 , 96.70) (98.10 , 99.40) (97.80 , 99.40) (99.70 , 98.86) (96.31 , 99.49) Toothbrush (100.0 , 98.10) (- , 98.80) (97.50 , 98.10) (100.0 , 98.70) (94.40 , 98.90) (95.20 , 98.93) (91.39 , 98.79) Transistor (91.50 , 97.00) (- , 97.50) (99.80 , 93.00) (100.0 , 96.30) (99.80 , 97.30) (99.10 , 97.99) (99.92 , 97.87) Zipper (97.90 , 95.10) (- , 98.50) (99.90 ,99.30) (98.80 , 98.80) (99.50 , 98.70) (98.50 , 99.08) (98.74 , 98.60) Av. objects (90.84 , 96.72) (- , 97.79) (96.61 , 95.82) (99.11 , 98.32) (99.09 , 98.63) (98.04 , 98.69) (98.34 , 98.85) Av. total (92.07 , 95.71) (- , 97.48) (97.12 , 95.97) (99.06 , 98.05) (99.37 , 98.45) (98.27 , 98.63) (98.87 , 98.74) Table 1: ROC-AUC results in the format (image-level, pixel-level). Our methods achieves state-of-the-art performance in both detection (image-level) and localization (pixel-level) metrics, ranking first on the localization task. Due to space constraints, both results had to be condensed in one table, and we excluded comparisons with PADE and PEFM (which were among the least performing ones). Separate localization and detection tables, including these two methods, can be found in the supplementary material. Oracle threshold Fair threshold Cat Fast Flow C- Flow Ours NFA Fast Flow C- Flow Ours AS Ours NFA Car 0.474 0.380 0.571 0.315 0.333 0.331 0.566 Gri 0.300 0.263 0.249 0.274 0.125 0.360 0.226 Lea 0.388 0.391 0.419 0.349 0.344 0.042 0.333 Til 0.553 0.392 0.610 0.418 0.286 0.617 0.600 Woo 0.412 0.432 0.478 0.279 0.250 0.182 0.467 Tex 0.425 0.379 0.466 0.327 0.268 0.306 0.438 Bot 0.667 0.735 0.620 0.494 0.737 0.232 0.615 Cab 0.375 0.230 0.554 0.375 0.185 0.285 0.548 Cap 0.276 0.235 0.229 0.242 0.219 0.108 0.210 HNu 0.559 0.539 0.541 0.558 0.244 0.279 0.539 MNu 0.457 0.390 0.581 0.428 0.325 0.403 0.548 Pil 0.501 0.370 0.424 0.471 0.338 0.245 0.411 Scr 0.347 0.229 0.363 0.315 0.201 0.134 0.344 Too 0.438 0.387 0.636 0.326 0.381 0.411 0.632 Tra 0.470 0.429 0.721 0.071 0.147 0.074 0.721 Zip 0.488 0.376 0.498 0.315 0.150 0.284 0.481 Obj 0.459 0.392 0.517 0.360 0.293 0.245 0.505 Tot 0.448 0.388 0.500 0.349 0.284 0.266 0.483 Table 2: Segmentation IoU comparison with the best performing methods in the literature: FastFlow and CFlow, for the oracle-like and fair thresholds defined in Section 4. Our method largely outperforms all others, even comparing the proposed automatic threshold with their oracle-like threshold.

We evaluate our method on MvTec-AD [bergmann2019mvtec], the most widely used benchmark for anomaly detection, as most state-of-the-art anomaly detection methods report their performances on this dataset. MvTec-AD consists of 15 categories from textures and objects, simulating real-world industrial quality inspection scenarios, with a total of 5000 images. Each category has normal (i.e. anomaly-free) images for training, a testing set with different kinds of anomalies, and some more normal images.

For a fair comparison, we adopt the ROC-AUC metric (the area under the Receiver Operating Characteristic curve), as it is the most used metric in the literature. The ROC-AUC results for the pixel-level (i.e. anomaly localization) and image-level (i.e. anomaly detection) are presented in Table 1. For the pixel-level metric, U-Flow achieves state-of-the-art results, even outperforming all previous methods on average. Regarding the image-level results, we simply used as image score as the maximum value of the anomaly score at the pixel level, achieving state-of-the-art results.

In addition, besides obtaining excellent results for both pixel and image-level ROC-AUC, our method presents another significant advantage with respect to all others: it produces a fully unsupervised segmentation of the anomalies, and significantly outperforms its competitors in terms of IoU, as shown in the next section.

4.1 Segmentation results

Giving an operation point is crucial in almost any industrial application. Most methods in the literature focus on the ROC-AUC metric and do not provide detection thresholds or anomaly segmentation masks. Actually, some methods do provide segmentation results, but these are not consistent with a realistic scenario: they are obtained using an oracle-like threshold that is computed by maximizing some metric over the test images [roth2022towards-patchcore, tsai2022multi-mspba, gudovskiy2022cflow].

In this section we report results on anomaly segmentation based on an unsupervised threshold obtained by setting NFA=1 (log(NFA)=0). As explained in Section 3.3.2, this threshold means that in theory, we authorize at most, on average, one pixel as false anomaly detection per image. Results for the IoU metric are shown in Table 2. We limit this comparison to the two top-performing methods. As the state-of-the-art methods to which we compare do not provide detection thresholds, we adopt two strategies: (i) we compute an oracle-like threshold which maximizes the IoU for the testing set, and (ii) we use a fair strategy that only uses training data to find the threshold. In the latter, the threshold is set to allow at most one false positive in each training image, as it would be analogous of setting NFA=1 false alarm. For completeness, we also included in the table (“Ours AS.” column) the results using this fair strategy over our anomaly score computed following (1). As can be seen, our automatic thresholding strategy significantly outperforms all others, even when comparing with their oracle-like threshold.

It is worth mentioning that, since the trained models for FastFlow and CFlow are not available, it was needed to re-train for all categories. To ensure a fair comparison, we performed numerous trainings until we obtained comparable ROC-AUC to the ones reported in the original papers. Note that for these methods, a different set of hyper-parameters is chosen for each category, varying sometimes even the number of flow steps and the image resolution, while for our method the architecture is always kept the same.

Finally, Figure 5 presents typical example result images from various categories, and their comparison with other methods, for visual evaluation.

4.2 Implementation and complexity details

The method was implemented in PyTorch [NEURIPS2019_9015-pytorch], using PyTorch Lightning [Falcon_PyTorch_Lightning_2019]. The NFs where implemented using the FrEIA Framework [freia], and for all tested feature extractors we used PyTorch Image Models [rw2019timm]. All trainings were performed in a GeForce RTX 2080 Ti.

For the multi-scale CaIT version, the input sizes are 448 and 224 pixels in width and height. The Normalizing Flow has 2 flow stages with 4 flow steps each. For computing the NFA, we used p=0.9 both for the Chi-Squared and Binomial distributions, and block sizes w1=5 and w2=3 for the fine and coarse scales, respectively.

Our method only use four flow steps in each scale. As a result it has less trainable parameters, in comparison with FastFlow and CFlow, as shown in Table 3.

5 Ablation Study

In this section we study and provide unbiased assessments regarding some of the contributions of this work: the significance of the U-shaped architecture and the benefits of the multi-scale Transformer feature extractor. Both results are shown all together in Table 4, and explained in the following sections.

Also, to get some insights and visualize the distinction between normal and abnormal samples, we include Figure 4, which displays the distribution of the NFA-based anomaly score (−logNFA) for both normal and anomalous pixels. The solid vertical line represents the automatic threshold (logNFA=0), and a dashed line the oracle-like threshold, i.e. the best possible threshold for IoU on the test set. The distributions are clearly separated, and the automatic threshold is located very close to the optimal point, supporting the a contrario thresholding strategy.

5.1 Ablation: U-shape

One of the contributions of the presented work is to propose a multi-level integration mechanism, by introducing the well-known U-shape architecture design to the Normalizing Flow’s framework. In order to demonstrate that this architecture better integrates the information of the different scales, we compare the results obtained in terms of ROC-AUC against a modification of the architecture, in which each flow stage runs in parallel, and the per-scale anomaly maps are merged at the end by just averaging them, as done by other methods such as FastFlow [yu2021fastflow]. Additionally, we compare the results obtained using each scale separately. The results, presented in the top half of Table 4, show that the U-merging strategy improves the performance in almost all cases. Furthermore, this strategy where the output of one scale inputs the next one, allows to use less flow steps for each scale, resulting in a network with fewer parameters, as shown in Section 4.2.

5.2 Ablation: feature extraction with multi-scale Transformer Feature Extractor FastFlow CFlow Ours ResNet18 4.9 M 5.5 M 4.3 M WideResnet50-2 41.3 M 81.6 M 34.8 M CaIT M48 14.8 M 10.5 M 8.9 M MS-CaIT - - 12.2 M Table 3: Complexity analysis: comparison of number of trainable parameters for different feature extractors. Figure 4: NFA-based anomaly score (−logNFA) distribution for the bottle (left) and metal nut (right) classes. Blue corresponds to normal pixels, and red to anomalous. The vertical line stands for logNFA=0, which corresponds nearly to the optimal point. Plots for other classes are included in the Appendix. Carp. Grid Leat. Tile Wood Bott. Cable Caps. H-nut M-nut Pill Screw Toot. Tran. Zipp. Total Scale 1 99.08 97.40 99.32 94.97 93.45 98.33 98.11 98.87 99.09 97.89 98.67 99.44 98.74 96.46 98.55 97.89 Scale 2 99.12 97.09 99.42 96.47 96.26 97.23 97.60 98.07 98.63 97.86 98.62 99.18 97.86 97.21 97.29 97.86 Average 99.44 98.25 99.52 97.27 96.40 98.61 98.50 98.85 99.16 98.29 99.12 99.50 98.78 97.66 98.69 98.54 Ours 99.42 98.49 99.59 97.54 97.49 98.65 98.61 99.02 99.30 98.82 99.35 99.49 98.79 97.87 98.60 98.74 ResNet 98.80 98.26 99.37 94.53 94.50 98.00 96.96 98.46 98.63 96.70 97.45 98.01 98.20 98.38 97.43 97.58 Arch. wide r18 r18 wide r18 r18 wide wide wide wide wide wide r18 wide wide - F. steps 6 6 4 8 6 4 4 4 4 4 4 6 6 4 4 - Table 4: Ablation results. The middle row (in-between horizontal lines), that shows the pixel-level ROC-AUC results of our proposed method (U-Flow), serves for comparison for the top and bottom halves of the table. Top half: ablation study for the scale-merging strategy. Results using the anomaly scores generated by each scale independently, and a naive way of merging them (average). Bottom half: ablation study for the feature extractor. The proposed method (that uses a multi-scale CaIT) is compared with ResNet feature extractors. Each category shows the ROC-AUC of the best variant we could obtain, varying several hyper-parameters, such as the architecture (wide-resnet-50 or resnet-18) and the number of flow steps. Figure 5: Example results for several categories, compared with FastFlow and CFlow in the second and third rows. The Last three rows correspond to our method: the anomaly score defined in (1), the NFA-based anomaly score (−logNFA), and the segmentation obtained with the automatic threshold logNFA


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3